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Abstract 

We consider the problem of learning a set from random samples. We show how relevant geometric and topo- 
logical properties of a set can be studied analytically using concepts from the theory of reproducing kernel Hilbert 
spaces. A new kind of reproducing kernel, that we call separating kernel, plays a crucial role in our study and is 
analyzed in detail. We prove a new analytic characterization of the support of a distribution, that naturally leads 
to a family of provably consistent regularized learning algorithms and we discuss the stability of these methods 
with respect to random sampling. Numerical experiments show that the approach is competitive, and often better, 
than other state of the art techniques. 
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1 Introduction 



In this paper we study the problem of learning from data the set where the data probability distribution is con- 
centrated. Our study is more broadly motivated by questions in unsupervised learning, such as the problem of 
inferring geometric properties of probability distributions from random samples. 

In recent years, there has been great progress in the development of theory and algorithms for supervised 
learning, i.e. function approximation problems from random noisy data [9 21 , 27 , 50 , 68] . On the other hand, while 
there are a number of methods and studies in unsupervised learning, e.g. algorithms for clustering, dimensionality 
reduction, dictionary learning (see Chapter 14 of |34|), many interesting problems remain largely unexplored. 

Our analysis starts with the observation that many studies in unsupervised learning hinge on at least one of 
the following two assumptions. The first is that the data are distributed according to a probability distribution 
which is absolutely continuous with respect to a reference measure, such as the Lebesgue measure. In this case 
it is possible to define a density and the corresponding density level sets. Studies in this scenario include [7, 28, 
|40l |64] to name a few. Such an assumption prevents considering the case where the data are represented in a 
high dimensional Euclidean space but are concentrated on a Lebesgue negligible subset, as a lower dimensional 
submanifold. This motivates the second assumption - sometimes called manifold assumption - postulating that the 
data lie on a low dimensional Riemannian manifold embedded in an Euclidean space. This latter idea has triggered 
a large number of different algorithmic and theoretical studies (see for example ||4j |5j [T9J [20j [36} »54 J ) . Though the 
manifold assumption has proved useful in some applications, there are many practical scenarios where it might 
not be satisfied. This observation has motivated considering more general situations such as manifold plus noise 
models 11171 1471, and models where the data are described by combinations of more than one manifold I4l1 l70l . 

Here we consider a different point of view and work in a setting where the data are described by an abstract 
probability space and a similarity function induced by a reproducing kernel |60J. In this framework, we consider 
the basic problem of estimating the set where the data distribution is concentrated (see Section [i~2] for a detailed 
discussion of related works). A special class of reproducing kernels, that we call separating kernels, plays a special 
role in our study. First, it allows to define a suitable metric on the probability space and makes the support of the 
distribution well defined; second, it leads to a new analytical characterization of the support in terms of the null 
space of the integral operator associated to the reproducing kernel. 

This last result is the key towards a new computational approach to learn the support from data, since the 
integral operator can be approximated with high probability from random samples 1 53 , 60 1 . Estimating the null 
space of the integral operator can be seen to be an ill-posed problem, and regularization techniques are needed to 
obtain stable estimators. In this paper we study a class of regularization techniques proposed to solve ill-posed 
problems [ 31 J and already studied in the context of supervised learning Il3l l43l . Regularization is achieved by 
filtering out the small eigenvalues of the sample empirical matrix defined by the kernel. Different algorithms 
are defined by different filter functions and have different computational properties. Consistency and stability 
properties for a large class of spectral filters and of the corresponding algorithms are established in a unified 
framework. Numerical experiments show that the proposed algorithms are competitive, and often better, than 
other state of the art techniques. 

The paper is divided into two parts. The first part includes Section|2j where we establish several mathematical 
results relating reproducing kernel Hilbert spaces of functions on a set X and the geometry of the set X itself. 
In particular, in this section we introduce the concept of separating kernel, which we further explore in Section 
[3] These results are of interest in their own right, and are at the heart of our approach. In the second part of 
the paper we discuss the problem of learning the support from data. More precisely, in Section El we illustrate 
some algorithms for learning the support of a distribution from random samples, and in Section p] we establish 
consistency and stability results for them. We conclude in Section [6] and [7] with some further discussions and 
some numerical experiments, respectively. A conference version of this paper appeared in 11261 . We now start by 
describing in some more detail our results and discussing some related works. 

1.1 Summary of main results 

In this section we briefly describe the main ideas and results in the paper. 
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The setting we consider is described by a probability space (X, p) and a measurable reproducing kernel K on 
the set X Q. The data are independent and identically distributed (i.i.d.) samples x%, . . . , x n , each one drawn 
from X with probability p. The reproducing kernel K reflects some prior information on the problem and, as we 
discuss in the following, will also define the geometry of X. The goal is to use the sample points X\,...,x n to 
estimate the region where the probability measure p is concentrated. 

To fix some ideas, the space X can be thought of as a high-dimensional Euclidean space and the distribution p 
as being concentrated on a region X p , which is a smaller - and potentially lower dimensional - subset of X (e.g. a 
linear subspace or a manifold). In this example, the goal is to build from data an estimator X n which is, with high 
probability, close to X p with respect to a suitable metric. 

We first note that a precise definition of X p requires some care. If p is assumed to have a density with respect to 
some fixed reference measure (for example, the Lebesgue measure in the Euclidean space), then the region X p can 
be easily defined to be the set of points where the density function is non-zero (or its closure). Nevertheless, this 
assumption would prevent considering the situation where the data are concentrated on a "small", possibly lower 
dimensional, subset of X. Note that, if the set X were endowed with a topological structure and p were defined 
on the corresponding Borel cr-algebra, it would be natural to define X p as the support of the measure p, i.e. the 
smallest closed subset of X having measure one. However, since the set X is only assumed to be a measurable 
space, no a priori given topology is available. Here we also remark that the definition of X p is not the only point 
where some further structure on X would be useful. Indeed, when defining a learning error, a notion of distance 
between the set X p and its estimator X n is also needed and hence some metric structure on X is required. 

Now, the idea is to use the properties of the reproducing kernel K to induce a metric structure - and conse- 
quently a topology - on X. Indeed, under some mild technical assumptions on K, the function 

d K (x, y) = y/K(x, x) + K{y, y) - 2 Re K(x, y) V x,y e X 

defines a metric on X, thus making X a topological space. Then, it is natural to define X p to be the support of p 
with respect to such metric topology. Note that the metric cLr also provides us with a notion of distance between 
closed sets, namely the corresponding Hausdorff distance djj ■ 

The problem we are interested in can now be restated in the following way: we want to learn from data an 
estimator X n of X p , such that liirin^oo dn{Xm X p ) = almost surely. While X p is now well defined, it is not clear 
how to build an estimator from data. A main result in the paper, given in Theorem |3j provides a new analytic 
characterization of X p , which immediately suggests a new computational solution for the corresponding learning 
problem. To derive and state this result, we introduce a new notion of reproducing kernels, called separating ker- 
nels, that, roughly speaking, captures the sense in which the reproducing kernel and the probability distribution 
need to be related. We say that a reproducing kernel Hilbert space H (or equivalently its kernel) separates a subset 
C c X, if, for any x £ C, there exists / € % such that 

f(x)^0 and f(y) = VyGC. 

If K separates all possible closed subsets in X, we say that it is completely separating. Figure [Tjillustrates the notion 
of separating kernel in the simple example of the linear kernel in a Euclidean space. 

Now, Theorem [3] states that, if either K is completely separating, or at least separates X p , then X p is the level 
set of a suitable distribution dependent continuous function F p . More precisely, let H be the reproducing kernel 
Hilbert space associated to K [2J, T : H — > H the integral operator with kernel K, and denote by its pseudo- 
inverse. If we consider the function F p on X, defined by 

F p (x) = (T^TK X , K x ) VxeX, 

and K separates X p , then we prove that 

X p = {x € X \ F p (x) = 1} , 
(where for simplicity we are assuming K (x, x) = 1 for all x E X). 
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Figure 1 : The separating property is illustrated in a simple situation where X = R 2 . In the top pictures, the support 
X p is a line passing through the origin and is separated by the linear kernel K(x, y) — x T y: for all x ^ X„, there 
exists a function / 6 % (a linear function on X) which is zero on X p and such that f(x) ^ 0. The pictures on the 
right are a plot of the plane y = f(x\, x<z). In the bottom pictures, the support is a segment passing through the 
origin. The linear kernel is too simple to separate this set: all planes are going to be zero also outside of the support 
(the dotted line in the picture). 

The above result is crucial since the integral operator T can be approximated with high probability from data 
(see [53 J and references therein). However, since the definition of F p involves the pseudo-inverse of T, the support 
estimation problem is ill-posed [65J and regularization techniques are needed to ensure stability. With this in mind, 
we propose and study a family of spectral regularization techniques which are classical in inverse problems J 31 1 
and have been considered in supervised learning in 1 3 . 43 1 . We define an estimator by 

X n = {xeX\ F n (x) > 1 - T n }, 

where F n {x) = (l/n)K* g\ n (K n /n)K x/ with (Kn)^ = K(xi,Xj), K x is the column vector whose z-th entry is 
K(xi, x), and K* is its conjugate transpose. Here g\ n (K„/n) is a matrix defined via spectral calculus by a spectral 
filter function g\ n that suppresses the contribution of the eigenvalues smaller than A„. Examples of spectral filters 
include Tikhonov regularization and truncated singular values decomposition |43], to name only a few. The error 
analysis for this class of methods can be derived in a unified framework and is done both in terms of asymptotic 
convergence, and stability to random sampling by means of finite sample bounds. Indeed, we prove in Theorem 
[5] and [6] that, if X is compacj^J then 

lim sup \F p (x) — F n (x)\ =0 almost surely, 

provided that lim^oo A„ = and sup n>1 {L\ n log n) / y/n < +00, where L\ n is the Lipshitz constant of the function 
: If X is not compact, these results hold replacing X with the intersection X DC for any compact subset C. 
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r A„ (c) = ct.9a„ (f) . Moreover 

lim dy(X„,X p )=0 almost surely, 

n— >oo 

provided that lim n ^oo r„ = and 

lim sup agX 1 y ' < 1 almost surely. 

n— >oo 7"n 

Note that, if X p is separated by K, then the convergence of F n to i 7 ^ can be proved without further assumptions on 
the problem. On the contrary in order to have convergence of X n to X p we need to choose a sequence r„ satisfying 
the condition above, and this requires knowledge of the convergence rate of F„ to F p . The latter is a property of the 
couple (p, K), not only of K. If the couple is such that sup x£X ||T - *.Sfa:|| < oo, with < s < 1, and the eigenvalues 
of the (compact and positive) operator T satisfy aj ~ j^ 1 ^ for some < b < 1, then we prove in Theorem [7] that, 
for n > 1 and <5 > 0, we have 

sup \F n (x) - F p (x)\ < C sAS ( - J 

with probability at least 1 — 2e~ d , for A„ = n -1 / ( 2s + h + 1 ) and a suitable constant C s & $ which does not depend on n. 

Finally, we remark that our construction relies on the assumption that the kernel K separates the support X p . 
The question then arises whether there exist kernels that can separate a large number of, and perhaps all, closed 
subsets, namely kernels that are completely separating. The answer is affirmative, and for translation invariant 
kernels on R d , Theorem factually gives a sufficient condition for a kernel to be completely separating in terms of 
its Fourier transform. As a consequence, the Abel kernel K(x, y) = e~^ x ~ v ^ a on the Euclidean space X = R d is 
completely separating. Interestingly, the Gaussian kernel K(x, y) = e _ H a; -2/ll 2 / ' 2 / which is very popular in machine 
learning, is not. 



1.2 Previous Work 

The problem of building an estimator X n of a subset X p C X which is consistent with respect to some kind 
of metric among sets has been considered in seemingly diverse fields for different application purposes, from 
anomaly detection - see [16] for a review - to surface estimation Il55l . We give a summary of the main approaches, 
with basic references for further details. 

Support and Level Set Estimation. Support estimation (also called set estimation) is a part of the theory of non- 
parametric statistics, where the geometry comes into play. We refer to [23 24 j for a detailed review on this topic. 
Usually, the space X is R d with the Euclidean metric d, and X p is the corresponding support of p. If X p is convex, a 
natural estimator is the convex hull of the data X n = conv {x% , . . . , x n }, for which convergence rates can be derived 
with respect to the Hausdorff distance 1301 l5l"l . If X p is not convex, Devroye and Wise [28 J propose the estimator 

n 

X n = y B(xi,e n ), 

where B(x, e) is the ball of center x and radius e, and e„ slowly goes to zero when n tends to infinity. Consistency 
and minimax converges rates are studied in [28 40] with respect to the distance 

d M (Ci,C a ) = M(C , iAC 2 ), 

where C\ AC 2 = {G\ \ C 2 ) U (C 2 \ C\) and (i is a suitable known measure. 

If p has a density / with respect to some known measure /i, a traditional approach is based on a non-parametric 
estimator /„ of /, a so called plug-in estimator. A kernel based class of plug-in estimators is proposed in [22 1, 
namely 

X n = {xEX\ f n ( X ) > C n } With fn(x) = ^ I Y / K ( X ^ Xi )' 
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where h n is a regularization parameter and c„ is a suitable threshold. Convergence rates with respect to gL are 
provided in Il22l . 

A related problem is level set estimation, where the goal is to detect the high density regions {x € X | f(x) > c}. 
Consistency and optimal convergence rates for different plug-in estimators 



have been studied with respect to both d H and d^, see for example [7, 66 , 58 1 for a slightly different approach. 
One class learning algorithm. In machine learning, set estimation has been viewed as a classification problem 
where we have at our disposal only positive examples. An interesting discussion on the relation between density 
level set estimation, binary classification and anomaly detection is given in [64J . In this context, some algorithms 
inspired by Support Vector Machine (SVM) have been studied in I56ll64ll69l . A kernel method based on kernel 
principal component analysis is presented in 1 35 1 and is essentially a special case of our framework. 
Manifold Learning. As we mentioned before, a setting which is of special interest is the one in which X is M. d 
and X p is a low dimensional Riemannian submanifold. In this case, the error of an estimator is usually studied in 
terms of a one-sided excess functional 



where d is the Euclidean metric. Some results in this framework are given in ll46l [Tl l44ll . 

Computational Geometry. A classic situation, considered for example in image reconstruction problems, is when 
the set X p is a hyper-surface of M. d and the data x\,...,x n are either chosen deterministically or sampled uniformly. 
The goal in this case is to find a smooth function / that gives the Cartesian equation of the hyper-surface, see for 
example BUEEHSQ. 

2 Kernels, Integral Operators and Geometry in Probability Spaces 

In this section we establish the results that provide the foundations of our approach. The basic framework in this 
paper is described by a triple (X, p, K), where 

- X is a set (endowed with a cr-algebra Ax)', 

- p is a probability measure defined on X; 

- K is a complex reproducing kernel on X, i.e., a complex function on X x X of positive type. 

We interpret X as the data space and p as the probability distribution generating the data. Roughly speaking, 
the kernel K provides a natural similarity measure on X and it defines its geometry. 

We denote by H the reproducing kernel Hilbert space associated with the reproducing kernel K (we refer to 
||2l l63] for an exhaustive review on the theory of reproducing kernel Hilbert spaces). The scalar product and norm 
in H are denoted by (•, •) and || ||, respectively. We recall that the elements of H are complex functions on X, 
and the reproducing property f(x) = (/, K x ) holds true for all x € X and / 6 H, where K x € % is defined by 
K x {y) = K(y, x). We consider complex kernels since this will make it easier to use Fourier theory in Section|3j and 
moreover there is no relevant difference with the real case (every real reproducing kernel Hilbert space is naturally 
embedded in its complexification, see Chapter 4 of 11631 ). 

In order to prove our results, we need some technical conditions on K. 

Assumption 1. The kernel K has the following properties: 

a) for all x, y G X with x ^ ywe have K x ^ K y ; 

b) the associated reproducing kernel Hilbert space % is separable; 

c) the complex function K is measurable with respect to the product a-algebra Ax ® Ax', 



x n = {xei f n (x) > c} 
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d) for all x e X, K(x, x) = 1. 



Assumptions |l|a} , |l|b} and |l|c| are minimal requirements. In particular, Assumptions 1 a I and |l|b| are needed 
in order to define a separable metric structure on X, while Assumption |l|c[ ensures that such metric topology is 
compatible with the (T-algebra Ax (see Proposition[T]below). In Proposition|2j the combination of |l|a} , [T|b| | an d|l|c) 
will allow us to define the support X p of the probability measure p, as anticipated in Section [i~T| AssumptionJTJdJ, 
instead, is a normalization requirement, and could be replaced by a suitable boundedness condition (in fact, even 
weaker integrability conditions could also be considered). We choose the normalization K(x, x) — 1 Va; € X since 
it makes equations more readable, and it is not restrictive in view of Lemma [Tjbelow. 

We now show how the above assumptions allow us to define a metric on X and to characterize the correspond- 
ing support of p in terms of the integral operator with kernel K. 

2.1 Metric induced by a kernel 

Our first result makes X a separable metric space isometrically embedded in H. This point of view is developed 
in [60 j. The relation between metric spaces isometrically embedded in Hilbert spaces and kernels of positive type 
was studied by Schoenberg around 1940. A recent discussion on this topic can be found in Chapter 2 § 3 of 0. 



Proposition 1. Under Assumption la*, the map dx ■ X x X — >• [0, +oo [ defined by 

d K (x,y) = \\K X -K y \\ = y/K{x,x)+K(y,y) -2ReK(x,y) (1) 
z's a metric on X. Furthermore 

i) the map $ : X — >• %, $(x) = K x is an isometry; 

ii) the kernel K is a continuous function on X x X, and each f e % is a continuous function. 



If also Assumption ljb i is satisfied, then 



Hi) the metric space (X, dx) is separable. 
Finally, if also Assumption TJc i holds true, then 



iv) the closed subsets of X are measurable (with respect to Ax): 

v) if Y is a topological space endowed with its Borel a-algebra and f : X —t Y is continuous, then f is measurable; in 
particular, the functions in % are measurable. 



Proof. Many of these properties are known in the literature, see for example [ 14 . 63 1 and references therein. For the 
reader 's convenience, we give a self-contained short proof. 

Assumption T|a states that $ is injective. Since dx{x, y) = \\^{x) — $(y)|| by definition, dx is the metric on X 



making $ an isometry, as claimed in item|IJ. Item|ii]| then follows from[i| and the reproducing property K(x, y) 
($(y),$(ic))and/(z)H/,*(a:)). 
If also Assumption [T|bJ holds true, then the set ) is separable, and so is X. Item|m| then follows. 
Suppose now that also Assumption 1 1 |c) holds true. Then the map dx is a measurable map, so that the open balls 
of X are measurable. Since X is separable, any open set is a countable union of open balls, hence it is measurable. 
It follows that the closed subsets are measurable, too, hence item[jv}. 

Let Y and / be as in item vj. If A c Y is closed, then f^ 1 (A) is closed in X, hence measurable by item|ivj. It follows 
that f^ 1 (A) is measurable for all Borel sets A c Y, i.e. / is measurable. Since the elements of H are continuous by 
[n}, they are measurable, and item|v]| is proved. □ 

In the rest of the paper we will always consider X as a topological metric space with metric dx- Note that dx is 
the metric induced on X by the norm of % through the embedding $ : X — > T~L. The next result shows that under 
our assumptions we can define the set X p as the smallest closed subset of X having measure one. 
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Proposition 2. Under Assumptions la \ Ipj i and lc\ there exists a unique closed subset X p c X with p(X p ) = 1 



satisfying the following property: if C is a closed subset of X and p(C) = 1, then C D X p . 
Proof. Define the measurable set X p as 

x p = p| a 

C closed 
P(C) = 1 

Clearly, X p is closed and measurable by Proposition [TJ Since X is separable, there exists a sequence of closed 
subsets (Cj)j>i such that every closed subset C = r\Cj k , for some suitable subsequence. Hence, X p = f] Cj 

i p{('j)-\ 

and, as a consequence, p(X p ) = 1. □ 

We add one remark. The set X p is called the support of the measure p and clearly depends both on the probability 
distribution and on the topology induced by the kernel K through the metric die on X. 

2.2 Separating Kernels 

The following definition of separating kernel plays a central role in our approach. 

Definition 1. We say that the reproducing kernel Hilbert space H separates a subset C c X, if for all x £ C, there exists 
feW. such that 

/(*) ^ and f(y) = Vy e C. (2) 
In this case we also say that the corresponding reproducing kernel separates C. 

We add some comments. First, in |2) the function / depends on x and C. Second, the reproducing property 
and Q imply that K x ^ and K x ^ K y for all x C and y e C (compare with Assumption 1 1 |a| ) . Finally, we 
stress that a different notion of separating property is given in [S3J. 

Remark 1. Given an arbitrary reproducing kernel Hilbert space H, there exist sets that are not separated by H. For example, 
if X = R d and % is the reproducing kernel Hilbert space with linear kernel K{x, y) = x T y, the only sets separated by % are 
the linear manifolds, that is, the set of points defined by homogeneous linear equations (see Figure^. A natural question is 
then whether there exist kernels capable of separating large classes of subsets and in particular all the closed subsets. Section 
^anwers positively to this question, introducing the notion of completely separating kernels. 

Next, we provide an equivalent characterization of the separating property, which will be the key to a com- 
putational approach to support estimation. For any set C, let Pc ■ H — >• H be the orthogonal projection onto the 
closed subspace 

He = speai{K x \ x e C}, 

i.e. the closure of the linear space generated by the family {K x \ x € C}. Note that = P c , — P c and 

kerP c = {K x | x e C}^ - {/ e H | f(x) =0Vxe C}. 

Moreover, define the function 

F c : X -> M, F c (x) - (P C K X , K x ) . (3) 
Then, we have the following theorem. 

Theorem 1. For any subset C c X, the following facts are equivalent: 

i) % separates the set C ; 

ii) for all x C, K x £ ran Pc; 

Hi) C = {xeX \ F c {x) = K(x, x)}; 
iv) $(C) = $(X) nran P c . 
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Under Assumption i"|g |, if C is separated by H, then C is closed with respect to the metric dx- 



Proof. We first prove that i) => ii). Given x £ C, by assumption there is / G 77 such that (/, K x ) — f(x) ^ 0, 
i.e. K x {f}^, and (/, K y ) = f(y) = for all y € C, i.e. / € kcr P c = ran P^. It follows that ran P c C {f}^ , and 
then K x ^ ran Pc. 

We prove ii) => iii). If a; € C, then if^ G ran Pc by definition of Pc, so that Fc(x) = K(x,x), hence C C 
{x G X | F c (a;) = if(x,x)}. On the contrary, yf x ^ C, we have by assumption that (7 — Pc)K x ^ 0, so that 
^(ar, x) - F c (z) = ||(J - Pc)K x \\ 2 £ 0, i.e. C D {x e X \ F c {x) = 7X(x, a:)}. 

We prove iii) =>■ i). If x C, define / = (7 — Pc)K x G ker Pc, so that /(y) = for all y £ C. Furthermore, 
/(x) = 7C(x, x) — Fc(x) ^ 0. Thus, / separates the set C. 

Finally, iv) is a restatement of ii) taking into account that K x G ran Pc for all x G C by construction. 
Under Assumption Tjq ), the map x i-> Pc(aO — 7iT(x, x) = (Pc<fr(x), 3>(x)) — ($(x), <I>(x)) is continuous by Proposi- 



tion[T] By item iii), C is the 0-level set of this function, hence C is closed. □ 

The next result shows that the reproducing kernel K can be normalized under the mild assumption that 
K(x, x) 7^ for all x € X, so that Assumption T]d i can be satisfied up to a rescaling of K. 



Lemma 1. Assume that K(x, x) > Ofor all x G X. Then, the reproducing kernel K' on X, given by 

K (x, y) = —j^^^=^^^= Vx, y € X, 
y / K(x,x)K(y,y) 

is normalized and separates the same sets as K. 

Proof. Clearly K is a kernel of positive type. Denote by 77' the reproducing kernel Hilbert space with kernel K', 
and define the feature map * : X -t 77, *(x) = K x j \\K X \\. It is simple to check that (9(y), = 7C'(x, y) and 

= {0}, so that the map : 77 ->■ 77' 

(*./)(a0 = </,*(*)> 
is a unitary operator with K' x = 4 , *(\E'(x)) [14J. Clearly, for any / G 77 and x G X 

= <**/, *,*( a! )) = ^^. 

\\ K x\\ 

The above equality shows that 77 and 77' separate the same sets. □ 
2.2.1 A Special Case: Metric Spaces 

It may be the case that the set X has its own metric dx, and the er-algebra Ax is the Borel cr -algebra associated 
with the topology induced by dx ■ The following proposition shows that the metric dx induced by 77 is equivalent 
to dx provided that 77 separates all the -closed subsets and the corresponding kernel is continuous. 

Proposition 3. Let X be a separable metric space with respect to a metric dx, and Ax the corresponding Borel a-algebra. 
Let 77 foe a reproducing kernel Hilbert space on X with kernel K. Assume that the kernel K is a continuous function with 
respect to dx and that the space 77 separates every subset of X which is closed with respect to dx- Then 



i) Assumptions Tja |, Tjb) and TJc| hold true, and K(x, x) > Ofor all x G X; 



ii) a set is closed with respect to dx if and only if it is closed with respect to dx- 

Proof. The kernel is measurable and the space 77 is separable by Proposition 5.1 and Corollary 5.2 in 1141 . Since 
the points are closed sets for dx and the <7\: -closed sets are separated by 77, then K x ^- (i.e. K(x, x) > 0) for all 
x £ X and K x =t K y if x ^ y by the discussion following Definition [T] 

We show that dx and dx are equivalent metrics. Take a sequence (xj)j>i such that for some x G X it holds that 
limj_j. 00 dx(xj, x) = 0. Since K is continuous with respect to dx, we have Um^oo dfc(xj, x) — 0. Hence, the en- 
closed sets are dx-closed, too. Conversely, if the set C is dx-closed, since 77 separates C, Theorem FT] implies that 
C = {x G X | K (x, x) — Fc(x) = 0}, which is a rf^-closed set by rf^-continuity of the map x H> K(x, xj— Fc(x). □ 
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Item ii) of the above proposition states that the metrics dx and dx are equivalent and implies that the set X p 
defined in Proposition [2] coincides with the support of p with respect to the topology induced by dx- 

2.3 The Integral Operator Defined by the Kernel 

We denote by Si the Banach space of the trace class operators on H, with trace class norm 

\\A\\ Si =tx[(A*A)s] =^((A'4)^„e, 
iei 

where {e;} ieJ is any orthonormal basis of H. Furthermore, we let <S 2 be the separable Hilbert space of the Hilbert- 
Schmidt operators on %, with Hilbert-Schmidt norm 

|| J 4||^=tr[^A] = ^p ei || 2 . 

Finally if A is any bounded operator on W, we denote by H-A^ its uniform operator norm. It is standard that 
miloo — ll^lls — ll-^llsr Moreover, for all functions /1, / 2 £ T~L, the rank-one operator /1 ® / 2 on"H defined by 

ih®h)U) = {f,h) h V/eW 

is trace class, and ||A ® / 2 || Sl = ||A ® / 2 || 52 = ||/i|| ||/ 2 ||. 

We recall a few facts on integral operators with kernel K (see [14J for proofs and further discussions). Under 
Assumption [TJ the S\ -valued map x h> K x ® K x is Bochner-integrable with respect to p, and its integral 



T= f K x ® K x dp(a 
J x 



(4) 



defines a positive trace class operator T with ||T|| 5 = tr [T] — 1 (a short proof is given in Proposition 12 of the 
Appendix). Using the reproducing property of T~L, it is straightforward to see that T is simply the integral operator 
with kernel K acting on H, i.e. 



(Tf)(x)= / K(x,y)f(y)dp(y) V/ e H. 
Jx 

Since T is positive and trace class, the Hilbert-Schmidt theorem gives that 

T = J2*jfi®fi, 

where (fj)jej is a finite or countable orthonormal family of eigenfunctions of T corresponding to the strictly 
positive eigenvalues (crj)j^j, and the series converges in the Banach space Si (hence in 6> 2 ). Note that in the above 
sum each eigenvalue is repeated according to its (finite) multiplicity. As \\T\\ S = 1, the positive sequence (crj)jej 
is summable and sums up to 1. 

The following is a key result in our approach. 

Theorem 2. Under Assumption^ the null space ofT is 

ker T = {K x \x£ X p }^ - ker P Xp , (5) 
where X p is the support of pas defined in Proposition^ 
Proof. Note that , for all / £ H, the set 

C f = {x £ X I f{x) =0} = {xeX\ (f,K x ) = 0} 
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is closed since / is continuous. By positivity of T, f G ker T if and only if 

(TfJ) = [ \f(x)\ 2 dp(x) = 
Jx 

which is equivalent to the condition that, for p-almost every x G X, f(x) = 0. Hence / G kcrT if and only if 
p{Cf) = 1, i.e. Cf D X p , or equivalently (/, K x ) — Vx G X p . Equation |5} then follows. □ 

In the following, we will use the abbreviated notation P p = Px p ■ Note that the space V. splits into the direct 

sum H=H P ® Hj, where 

Hp = ran P p = ran T = span{K x \ x G X p } = spanj/^ | j G J} 
"Hp = kerP p = kerT = {/ G H \ f(x) = Vx G X p }. 

Under Assumption[l| we also introduce the integral operator L K : L 2 (X, p) -> L 2 (X, p), 

(L K cf>)(x)= [ K(x,y)^y)dp(y) V0GL 2 (X,p), 
Jx 

which is a positive trace class operator, too. Note the difference between the operators T and Lk- although their 
definitions are formally the same, the respective domains and images change. The family of eigenfunctions and 
eigenvalues of is strongly related to the family (fj,aj)j e j. Indeed, as shown in |14, 53 1, the sequence (aj)j e j 
coincides with the family of strictly positive eigenvalues of Lk (with the same multiplicities). Furthermore, if we 
set 

(f>j (x)=<jj 2 fj (x) for almost all x G X, (6) 
then the family {4>j)je.j is orthonormal in L 2 (X, p), and 

je.J 

where the series converges in trace norm. Conversely, let (<fij)j e j be an orthonormal family in L 2 (X, p) such that 
the decomposition |7} holds true. In general, each element 4>j is an equivalence class of functions defined p-almost 
everywhere. In particular, the value of <pj is not defined outside X p . However, in each equivalence class we can 
choose a unique continuous function, denoted again by tfij, which is defined at every point of X by means of the 
extension equation 

4>j (x) =<rJ 1 J K(x, y)^ (y)dp(y) Vx G X, (8) 

see I' 1811551 . With this choice, which will be implicitly assumed in the following, we have that |(6} is satisfied for all 
x G X and j G J. 

2.4 An Analytic Characterization of the Support 

Let us suppose that Assumption [T] holds true. Collecting the previous results, if H separates X p , then Theorem[T] 
gives that 

X p = {xeX\F Xp (x) = l}. 

The function Fx p is defined by (|3) in terms of the projection P p , which, in light of Theorem|2j can be characterized 
using the operator T. Indeed, from the definition of Fx p and |5) we have 

F p (x)=F Xp (x) = (P P K X ,K X ) = (T^TK X ,K X ) = (d(T)K x ,K x ) = £ \f (x)\ 2 (9) 

where is the pseudo-inverse of T and 9 is the Heaviside function 9(a) = l]o.+oo[( cr ) (note that with our definition 
9(0) = 0). The above discussion is summarized in the following theorem. 
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Theorem 3. If % satisfies Assumption^and separates the support X p of the measure p, then 

X p = {xeX\ F p {x) = 1} = {x G X | (T^TK X ,K X ) = 1} . 

As we discussed before, a natural question is whether there exist kernels capable to separate all possible closed 
subsets of X. In a learning scenario, this can be translated into a universality property, in the sense that it allows 
to describe any probability distribution and learn consistently its support [271 . Note that in a supervised learn- 
ing framework a similar role is played by the so called universal kernels [15 62] . The following section answers 
positively to the previous question, introducing and studying the concept of completely separating kernels. Inter- 
estingly, there are universal kernels in the sense of ITT51 l62ll which do not separate all closed subsets of X, as for 
example the Gaussian kernel. 



3 Completely separating reproducing kernel Hilbert spaces 

The property defining the class of kernels we are interested in is captured by the following definition. 
Definition 2 (Completely Separating Kernel). A reproducing kernel Hilbert space % satisfying Assumption TJa ) is called 



completely separating ifH separates all the subsets C c X which are closed with respect to the metric da defined by |TJ. 
In this case, we also say that the corresponding reproducing kernel is completely separating. 

The definition of completely separating reproducing kernel Hilbert spaces should be compared with the anal- 
ogous notion of complete regularity for topological spaces. Indeed, we recall that a topological space is called 
completely regular if, for any closed subset C and any point x ^ C, there exists a continuous function / such that 
f(x) 7^ and f(y) = for all y e C. As we discuss below, completely separating reproducing kernels do exist. For 
example, for X = R d both the Abel kernel K(x, y) = e~" x ~ v "'' 7 and the ^-exponential kernel K (x, y) = e 'flli/ 0- 



is 



are completely separating, where ||x|| is just the Euclidean norm of x = (x 1 , . . . , x d ) in M. d and ||x|| 1 = J2j=i 
the ^i-norm. Indeed this follows from Theorem |4]and Proposition |6]below, which give sufficient conditions for a 
kernel to be completely separating in the case X = R d . Note that the Gaussian kernel K(x, y) = e" x ~ y " 1° on 
R d is not completely separating. This is a consequence of the following fact. It is known that the elements of the 
corresponding reproducing kernel Hilbert space H are analytic functions, see Corollary 4.44 in [63 [. If C is a closed 
subset of R d with non-empty interior and / e T~L is equal to zero on C, then a standard result in complex analysis 
implies that f(x) — for all x e R d , hence H does not separate C. 

We end this section with Proposition |6j which gives a simple way to build completely separating kernels in 
high dimensional spaces from completely separating kernels in one dimension, the latter usually being easier to 
characterize. 



3.1 Separating Properties of Translation Invariant Kernels 

The first result studies translation invariant kernels on R d , i.e. of the form K(x,y) = K(x — y). We show that 
if the Fourier transform of the kernel satisfies a suitable growth condition, then the corresponding reproducing 
kernel Hilbert space is completely separating. As usual, L p (R d ) denotes the spaces of functions on R d which are 
p-integrable with respect to the Lebesgue measure dx, with p £ [1, +oo [. If <f> € L 1 (Mr), its Fourier transform is the 
continuous bounded function <f> on R d given by 

<j>(z) = [ e- 2 ™- x </)(x)dx. 

Similarly if (/> g L 2 (R d ), we denote by <f> its Fourier-Plancherel transform, obtained extending the above definition 
from L 1 (]R <i ) n L 2 (R d ) to L 2 (R d ) by unitarity. Throughout, we assume R d to be a metric space with respect to the 
standard metric d^d induced by the Euclidean norm. 

We need a preliminary result characterizing a reproducing kernel Hilbert space, whose reproducing kernel is 
continuous and integrable, as a suitable non-closed subspace of L 2 (R d ). The first part is a converse of Bochner's 
theorem (Theorem 4.18 in Il32l ). 
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Proposition 4. Let K : R d — > C be a continuous function in L 1 (U. d ) such that its Fourier transform K is strictly positive. 
Then the kernel K{x,y) = K{x — y) is positive definite and its corresponding reproducing kernel Hilbert space H is 

H = j<^G L 2 {R d ) | J K (z)- 1 \4>(z)\ 2 dz <ooj (10) 

with norm 

K{z)- l \j){z)\ 2 dz G H. (11) 



Proof. Let Lk ■ L 2 (R d ) — > L 2 (R d ) be the integral operator of kernel K, namely 

(L K <t>){x)= [ K(x-y)<t>(y)dy = (K*<t>)(x), 



which is well defined and bounded since K G L 1 (R d ). Since Lk is a convolution operator, Fourier transform turns 

it into the operator of multiplication by the bounded function K, that is Lk4> — K<j) for all cf> G L 2 (R d ). It follows 
that 

(L K( j ) ,cf>} L2 = ^K4 > ,4,j L2 >0 V0eL 2 (R d ) 

since K > by assumption, hence Lk is a positive operator. In order to show that K is positive definite, pick 
a Dirac sequence ((fi n )n>i centered at 0, and, for each x e X, define ip* be equal to <^f>(y) = f n {y — Fixed 
xi,x 2 , ■ ■ ■ , xn € K d and ci,c 2 , . . . , cn & C, set 0„ = £) i=1 Ci<Pn*' men 

JV AT 
0< (L K (f) n ,(f> n ) L2 = ^ C iC]( L K ( Pn i i l Pn)L\^ X! c i^j K ( x j ' X i) ' 

where the last equality is due to continuity of K and the usual properties of Dirac sequences. It follows that 

Ylij—i CiC]K(xj,Xi) > 0, i.e. the kernel K is positive definite. 

Let T~L be the reproducing kernel Hilbert space associated to K. Since the support of the Lebesgue measure is R d , 
Lk 1 ! 2 is a unitary isomorphism of L 2 (R d ) onto T-L (see Proposition 6.1 in Ifl4ll and the discussion following it). 

Clearly L K 1/2 cf) = k 1 / 2 ^, so that (10) and (ll} follow. □ 

We now state a sufficient condition on K ensuring that % is completely separating. 
Theorem 4. Let K :R d — > Cbea continuous function in L 1 (R d ) such that 

* Ma a + niW ¥ " eEJ <12) 

/or some a, b, 71, 72 > 0. Then, 

i) the translation invariant kernel K(x, y) — K(x — y) is positive definite and continuous; 

ii) the topologies induced by the metric dx and the Euclidean metric d R d coincide on R d ; 
Hi) the kernel K is completely separating. 

Proof. Condition (12) implies that K is strictly positive, so item i) follows from Proposition Q In particular, 
from (10) we see that, if <f> G L 2 {R d ) and f Rd (1 + b ||z|| 71 ) 72 \4>(z)\ 2 dz is finite, then <j> G H. This implies that 
C^°(R d ) c H: indeed, if <f> G C£°(R d ), then is a Schwartz function on R d , hence the last integral is convergent. 
Functions in C^(R d ) separate every set C which is closed with respect to the metric d^d, hence H separates the 
d^d -closed subsets. Items ii) and iii) then follow from Proposition|3] □ 

As an application, we show that the Abel kernel is completely separating. 
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Proposition 5. Let 

K : R d x R d -> M, if (a, y) = e" 1 ^ , (13) 

with a > 0. Tfaen if is a positive definite kernel and the corresponding reproducing kernel Hilbert space H is completely 
separating for all d> 1. 

Proof. A standard Fourier transform computation gives 

_ d+l 

where T is Euler gamma function (Theorem 1.14 in ll6fl ). The claim then follows from Theorem]!] □ 

Equations (TO) , (TT) and (T!) s how that (up to a rescaling of the norm) the reproducing kernel Hilbert space 
associated to the Abel Kernel (13) is just W^ d+1 ^ 2 (M. d ), the Sobolev space of order (d + l)/2. 

3.2 Building Separating Kernels 

The following result gives a way to construct completely separating reproducing kernel Hilbert spaces on high 
dimensional spaces. 

Proposition 6. If X ir i = 1,2 ... d, are sets and are completely separating reproducing kernels on Xi for all i = 
1, 2 . . . d, then the product kernel 

K((x u ...,x d ),(yi,..., y d ))=K^ (x llVl ) ■ ■ ■ (x d , y d ) 
is completely separating on the set X = X\ x X 2 x . . . x X d . 

Proof. Each set Xi and X are endowed with the metric d K (,) and dx induced by the corresponding kernels, and Hi 
and H denote the reproducing kernel Hilbert spaces with kernels K^ % ' and K, respectively. A standard result gives 

that H=Hi®...®H d and K x = K$ ® . . . <8> Kf} for all x = {x x > . . ■ , x d ) £ X • We claim that the d^-topology 
on X is contained in the product topology of the d K{i) -topologies on Xi (actually it is not difficult to show that 
the two topologies coincide). Indeed, if (xi ik ) k >i are sequences in Xi such that lim^oo d K (i)(xi ik ,Xi) = for all 
i = 1, . . . , d, then 

2 II 1 1 2 

lim d K ((xx, k ,...,x d ,k)A x ii---i x d)) = lim \\ K (x 1 k ,...,x d k ) - K( x Wd )\\ 

= lim [K^(x ltk ,x l!k ) ■ ■ ■ K^ d \x d<k ,x d<k ) - 2ReK^(x 1 . k ,x 1 ) ■ ■ ■ K^(x d ^x d ) 

k^-oo 

+K^\x 1 ,x 1 )---K^(x d ,x d )} 
= 0, 

since lim fe _ i . 00 K ^{xi, k ,x it k) = lim^oo K^(x^ k , Xi) — K^(xi,Xi). We now prove that H is completely separating. 
If C C X is d^-closed and x = (x\ , . . . , x d ) e X\C, since C is also closed in the product topology, f or all i = 1 , . . . , d 
there exists an open neighborhood Ui of Xi in Xi such that U = U\ X . . . X U d C X \ G. Since each Hi is completely 
separating, for aE i = 1, . . . , d there exists fa £ Hi such that fi(xi) ^ and f(yi) = for all t/j e Xi\Ui. Then the 
product function / = fx ® . . . 8> f d is in'H, and satisfies f(x) ^ and f(y) = for all y e C. □ 

As a consequence, the Abel kernel defined by the £i-norm 

d 

_ll a; — ylll -|— r Vi\ 

K{x,y) =e 5 =[[ e ~, x = (xi, . . . , x d ), y= {y l7 ...,y d ) 

i=i 

is completely separating since each kernel in the product is positive definite and completely separating by Propo- 
sition |5] 
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4 A Spectral Approach to Learning the Support 



In this section we study the set estimation problem in the context of learning theory. We fix a triple (X, p, K) as in 
Section |2j and assume throughout that the reproducing kernel K satisfies Assumption [T| We regard X as a metric 
space with respect to dx , and continue to denote by X p the support of p defined in Proposition [2] 

If Ti separates X p , Theorem [3] shows that the support X p is the 1-level set of a suitable function F p defined by 
the integral operator T, and therefore depending on K and p. However, the probability distribution p is unknown, 
as we only have a set of i.i.d. points x\, . . . , x n sampled from p at our disposal. Our task is now to use our sample 
in order to estimate the set X p . 

The definition of T given by Q suggests that it can be estimated by the data dependent operator 

1 " 

T n = -Y j K Xi ®K Xi . (15) 
n * — ' 
»=i 

The operator T n is positive and with finite rank; in particular, T n G S% and ||T„|| S = tr [T n ] = 1. We denote by 

iPj )ieJn me shictly positive eigenvalues of T n (each one repeated according to its multiplicity) and by {fj)jej n 
the corresponding eigenvectors; note that in the present case the index set J n is finite. However, though T n con- 
verges to T in all relevant topologies (see Lemma [2] and Remark [4] below), in general T^T n does not converge 
to T^T since may be unbounded, or, equivalently, since may be an accumulation point of the spectrum of 
T when dimH = 00. Hence, the problem of support estimation is ill-posed, and regularization techniques are 
needed to restore well-posedness and ensure a stable solution. In the following sections, we will show that spec- 
tral regularization [3, 31 43] can be used to learn the support efficiently from the data. 

4.1 Regularized Estimators via Spectral Filtering 

An approach which is classical in inverse problems (see ||3T| , and also |3 43] for applications to learning) con- 
sists in replacing the pseudo-inverses Tl and with some bounded approximations obtained by filtering out the 
components corresponding to the eigenvalues of T n and T which are smaller than a fixed regularization param- 
eter A. This is achieved by introducing a suitable filter function g\ : [0, +00 [— > [0, +00 [ and replacing T\, T with 
the bounded operators g\(T n ), g\(T) defined by spectral calculus. If the function g\ is sufficiently regular, then 
convergence of T„ to T implies convergence of g\(T n ) to g\(T) in the Hilbert-Schmidt norm. On the other hand, 
if the regularization parameter A goes to zero, then g\ (T) converges to T> in an appropriate sense. We are now 
going to apply the same idea to our setting. Since we are interested in approximating the orthogonal projection 
P p = T^T = 6(T) rather than the pseud o-mverse , we introduce a low-pass filter r\, in a way that the bounded 
operator r\(T) is an approximation of 9{T). In terms of the previously defined function g\, this can be achieved 
by setting r\{<j)—g\{a)a for all a e R, so that r\(T) = g\(T)T. Explicitely, in terms of the spectral decompositions 
of T n and T we have 

rx{T n )=Y,r^T ) )ff > ®fr\ r x {T)=Y,rx{v j )f j ®f j . 

Note that, since the spectra of T n and T are both contained in the interval [0,1], we can assume that the functions 
gx and r\ are defined on [0, 1]. Moreover, as the operators r\{T n ) and r\(T) approximate orthogonal projections, it 
is useful to have the bound < r\(T n ), r\(T) < I satisfied for all T n and T's, and this can be achieved by choosing 
the function r\ such that < r\(a) < 1 for all a. 

As a consequence of the above discussion, the characterization of filter functions giving rise to stable algorithms 
is captured by the following assumption. 

Assumption 2. The family of functions (r\)\ >0 , with r\ : [0, 1] — > [0, 1] for all A > 0, has the following properties: 

a) r A (0) = 0forallX> 0; 

b) for all a > 0, we have lim A _j. + r\(a) = 1; 
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c) for all A > 0, there exists a positive constant L\ such that 



\rx{<r)-r x {T)\<L x \a-T\ V^re [0,1]. 



By Assumption 2]a there exists a function g\ : [0, 1] — > [0, +oo[ such that r x (cr) — g\{a)a. On the other hand, by 
Assumption |2|b| we have lim A ^ + r x{ a ) — ^( CT ) ror ai l a € [0, 1]. Assumption |2|c| is of technical nature, and will 
become clear in Section 5.2 here we note that in particular it implies that r\ is a continuous function for all A > 0. 

A few examples of filter functions r\ satisfying Assumption [2] and of corresponding functions g \ are given in 
Table [E] It is easy to check that for each of them L\ = 1/A. See j 31 J for further examples. 



Tikhonov regularization 


rx{a) = a + X 


a + A 


Spectral cut-off 


r\{cr) = i]A,+oo[(o-) + ^%A](o-) 


gx(cr) = -!]a,+oo[0) + \l[ ,\](<r) 

(J A 


Landweber filter 


m 
k=0 


rn 
k=0 



Table 1: Examples of filter functions satisfying Assumption [2] For Landweber filter the regularization parameter 
is a natural number m. 

For a chosen filter, the corresponding regularized empirical estimator of F p is defined by 

F n (x) = (r x jT n )K x ,K x ) - £ r A „(aj n) ) \ff {xtf (16) 

where we allow the regularization parameter A„ to depend on the number of samples n. Note that the functions 
F n and F p are continuous on X by continuity of the mapping x H> K x (see[j} of Proposition [TJ. In Section [5] we 
will show that, for an appropriate choice of the sequence (A„)„>i, the estimator F n converges almost surely to F p 
uniformly on compact subsets of X. Unfortunately, this does not imply convergence of the 1-level sets of F n to the 
1-level set of F p in any sense (as, for example, with respect to the Hausdorff distance). However, an estimator of 
X p can be obtained by setting 

X n ={xe X\F n (x)>l-r n }, (17) 

where r n > is an off-set parameter that depends on the sample size n (recall that F n takes values in [0, 1]). In 
Section|5]we show that, for a suitable choice of the sequence (r n )„>i, the set X n is indeed a consistent estimator of 
the support with respect to the Hausdorff distance. 

In the following section we discuss some remarks about the computation of F n . 

4.2 Algorithmic and Computational Aspects 

We show that the computation of F n (hence of X n ) reduces to a finite dimensional problem involving the empirical 
kernel matrix defined by the data. To this purpose, it is useful to introduce the sampling operator 



S n : H -> C" S n f = : , (18) 
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which can be interpreted as the restriction operator which evaluates functions in H on the points of the training 
set. The adjoint of S n is 

/ a i 



i=l 



s* n -.c n ^n s* 

and 5* can be interpreted as the out-of-sample extension operator [l8j|53]|. A simple computation shows that 

T n = —S n S n S n S n — K n (K n ) = K[xi,Xj). 
n % -> 

Hence, considering the filter given in the form r\{T n ) = g\{T n )T n , we have 

rrp \ [ V»i | S n S n 1 „„ / S n S n \ 1 /K„\ 



n \ n 



where the second equality follows from spectral calculus. Using the definition of the sampling operator, we can 
consider the n-dimensional vector defined by 



— S n K x — 




and 1 16 1 can be written as 

Fn(x) = (r Xn (T n )K X7 K x ) = (- g x (^) S n K x ,S n K x ) — — K* g\ (^) K x , (19) 

\n \ n J I n \ n J 

where K* is the conjugate transpose of K^. More explicitly we have 

F n (x) = ^2ai(x)K(x,Xi) aj(x) = - ^ (g\ n ( — j j K(xj, x). (20) 

i=l n j=l ^ \ n / / ij 

The above equation shows that, while H could be infinite dimensional, the computation of the estimator reduces 
to a finite dimensional problem. Further, though the mathematical definition of the filter is done through spectral 
calculus, the computations might not require performing an eigen-decomposition. As an example, for Tikhonov 
regularization the coefficient vector a(x) in < [20) is given by 



a(x) = (K n + n\ n ) 1 K X . 

In the case of the Landweber filter, it is possible to prove that the coefficient vector can be evaluated iteratively by 
setting a°(x) = 0, and 

a*(x) = a'-^x) + -(K X - K^-^x)) 
n 

for t = 1, . . . , m. 

We thus see that the estimator corresponding to Tikhonov regularization can be computed via Cholesky de- 
composition and has complexity of order 0(n 3 ). For Landweber iteration the complexity is 0(n 2 m), where m is 
the number of iterations. Finally, the spectral cut-off, or truncated SVD, requires 0(n 3 ) operations to compute the 
eigen-decompostion of the kernel matrix. Further discussions can be found in |43] and references therein. We end 
remarking that, in order to test whether N points belong or not to the support, we simply have to repeat the above 
computation replacing ~K X by a n x N matrix JC X ,N, m which each column is a vector K. x corresponding to a point 
x in the test set. Note that in this case the coefficients a(x) will also form an x N matrix. 
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5 Error Analysis: Convergence and Stability 



In this section we develop an errror analysis for the proposed class of estimators. First, we discuss convergence 
(consistency) and then stability with respect to random sampling in terms of finite sample bounds. We continue to 
suppose throughout this section that Assumptionlllholds true, and consider X as a metric space with metric da ■ 



5.1 Empirical data 

We recall that the empirical data are a set of i.i.d. points X\, . . . , x n , each one drawn from X with probability p. 
Since we need to study asymptotic properties when the sample size n goes to infinity, we introduce the following 
probability space 

n = {(x i ) i > 1 \x i exvi>i}, 

endowed with the product cr -algebra An— Ax £3> Ax <£>■■■ and the product probability measure V=p <£> p <£> .... 
We recall that, given a measurable space M and an integer n, an M-valued estimator of size n is a measurable map 
S„ : — > M depending only on the first n-variables, that is 

E n (u) =£ n (xi,...,x n ) oj = (xi)i>i 

for some measurable map : X n — !• M. The number n is the cardinality of the sampled data. We then have the 
following facts. 

Proposition 7. For all n> 1 

i) T n is a Sk-valued estimator for k — 1,2; 

ii) ifX is locally compact, then F n is a C{X)-valued estimator, where C(X) is the space of continuous functions on X with 
the topology of uniform convergence on compact subsets. 



The proof of the above proposition is rather technical, and we defer the interested reader to Appendix A.l for 
more details. 

Remark 2. The assumption that X is locally compact is needed to ensure that the topology of uniform convergence on 
compact subsets is second countable. It is always satisfied if the set X is a metric space with respect to its own metric dx, the 
topology induced by dx is locally compact and second countable, the kernel K is a dx-continuous function, and K separates 
every subset of X which is closed with respect to dx (see Proposition |3j. If X is not locally compact, then, in order to have 
measurability of F n , one needs to replace the probability measure P with the outer measure (see the discussion in iT39ll67lf ). 

Remark 3. Statisticians adopt a different notation: the data are described by a family Yi,Y%,... of random variables taking 
value in X, each defined on the same probability space (T, Ar,Q), which are i.i.d. according to p. An M-valued estimator of 
size n is then simply a random variable £ n (Y"i, . . . , Y n ), where £„ : X n — » M is a measurable map. The equivalence between 
the two approaches is made clear by setting (T, Ar, Q) = An, P) and Yi(w) = Xifor all lu — {xj)j>\ and i > 1. 

Concentration of measure results for random variables in Hilbert spaces can be used to prove that T n is an 
unbiased estimator of T, as stated in the following lemma. 

Lemma 2. For n > 1 and 8 > 0, 

| r -T.U<2<^ (21) 

with probability at least 1 — 2e~ s . Furthermore 



lim r^HlT-TJL =0 almost surely. (22) 
— logn 
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Proof. The result is known, but we report its short proof. For alii > 1 define the random variables Zj : SI — > <S 2 as 

Zi(ui) = K Xi (g) w = (%j)j>i G O. 

The fact that is measurable follows from Lemma [3| in | A . 1 Then, for all i > 1, we have H^iHs., < 1 almost surely, 
E[Zi] = T, and clearly E[||Zi||^J < 1. The first result follows easily applying Lemma |6| in | A . 3 and simplifying the 
right hand side of (41) / and the second is a consequence of Lemma [7] in [A3 □ 



Remark 4. Note that (22 1 and Theorem 2.19 in [59] imply that 



lim \\T — T n \\ s —0 almost surely . 

n— >oo L 1 



5.2 Consistency 

We now choose a family of filter functions (ta)a>o an d study the convergence of the associated estimators F n and 
X n introduced in Section|4] 

We begin proving convergence of the functions F n defined in (16) to the function F p in (|9}. We introduce the 

map G\ : X — ;> K defined by 

G x (x) = (r x (T)K x ,K x ) Vx e X, 

which can be seen as the infinite sample analogue of F n . Clearly, G\ is a continuous function. For all sets C c X, 
we then have the following splitting of the error into two parts, the sample error and the approximation error 

sa?\F n {x) - F p (x)\ < sup\F n (x)-G Xn {x)\+sup\G Xn {x)-F p (x)\. (23) 
xec iec igc 



sample error approximation error 

In order to prove consistency, we need to show that the left hand side goes to as the sequence of regularization 
parameters (A„)„>i tends to 0. This will be done separately for the approximation and the sample errors in the 
next two propositions. 



Proposition 8. Under Assumption 2b \ if the sequence (A„)„>i is such that limbec \ n = 0, then, for any compact subset 
C X, 

lim sup|G A 0) - FJx)\ = 0. 

Proof. Assumption |2|b) and lini, i _ >00 A„ = imply that the sequence of non-negative functions (r\ n )„> i is bounded 
by 1 and converges pointwisely to the Heaviside function 9 on the interval [0, 1]. Spectral theorem ensures that, 
for all ieC, 

Urn r Xn (T)K x = 6(T)K X . (24) 

n— >oo 

Given e > 0, by compactness of C there exists a finite covering of C by balls of radius e, namely C C U ™ x B(xi, e). 
By p4") there exists n such that 

max \\r Xn (T)K Xi - 9(T)K Xi \\ <e Vn > n Q . 

i£{l,...,m} 

Hence, for all n > n 0r we have 

sup \G Xn (x) - F p (x)\ = sup \((r Xn (T) - 6(T))K X , K x }\ 

< sup\\K x \\ sup \\(r Xn {T)-9{T))K x \\ 

xGC xGC 



< max sup \\(r Xn (T) - 6{T))K X . + (r Xn (T) - 6(T))(K X - K x .)\\ 

ie{l,...,m} xe B(x,.e) 

< max sup (\\(r Xn (T)-e(T))K Xi \\ + \\r Xn (T)-6(T)\\ x \\K x -K Xi \\) 

J6{l,...,m} x< z B (x % ,t) 

<e + e sup \r Xn (a) - 9(a)\ = 3e, 

cr£[0,l] 
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where \\K X — K Xi \\ < e for all x e B(xi,e) since \\K X — K x .\\ — d K (x,xA, an, because |j"a„(o")| < 1, < 1, 



Convergence to zero of the sample error follows from ||22j and the next proposition. 
Proposition 9. For all sets C C X we have 

sup \F n (x) - G x Jx)\ < \\r Xn (T n ) - r Xn (T)\\ S2 . (25) 



In particular, if Assumption 2 c * holds, then 

sup \F n (x) - G Xn {x)\ < L Xn \\T n - T|L . (26) 
xec 

Proof. For all x E X, we have the bound 

\F n (x) - G Xn (x)\ = \((r Xn (T n ) - r Xn {T))K x ,K x )\ 

< IK (T n )-r A „(T) 11^ ||^|| 2 

< \\rx n (T n )-r Xn (T)\\ S2 , 

which proves l(25|. Assumption \2\c\ and Theorem 8.1 in [8j (see also Lemma [5] in |A.2 for a simple unpublished 
proof due to A. Maurer) imply that 

\\r Xn (T n )-r Xn (T)\\ S2 <L Xn \\T n -T\\ S2 . 
Inequality ( (26) then follows. □ 



The above results can be combined in the following theorem, showing that, if the sequence A„ is suitably 
chosen, then F n converges almost surely to F p with respect to the topology of uniform convergence on compact 
subsets of X. 

Theorem 5. Under Assumption^ if the sequence (A„)„>i is such that 

lim A„ = and sup ^ n < +oo, (27) 



then, for every compact subset C c X, 



lim sup|F„(x) — F p (x)\ = almost surely . (28) 

n->oo xl£C 



Proof. We show convergence to zero of both the two terms in the right hand side of inequality | |23) , thus implying 
((28). By |26), we have 

L\ logn \fn \\T n — TIL Jn \\T n — Til <; 

sup F n (x) - G X (x) < L x T n T\\ s = A > V V ^ < M— — ^—^ ^, 

xec 2 logn logn 



where M = sup„> i(L Xn logn)/ v / ri is finite by | [27) . Then |22) implies that the first term in the right hand side of 
inequality ( (23) converges to zero almost surely. Since the second term goes to zero by Proposition |§J the claim 
follows. □ 

As already remarked above, uniform convergence of F n to F p on compact subsets does not imply convergence of 
the level sets of F n to the corresponding level sets of F p in any sense (as, for example, with respect to the Hausdorff 
distance among compact subsets). For this reason we are led to introduce a family of threshold parameters (r„) n >i 
and define the estimator X n of the set X p as in ( [17} . The following result shows that for a suitable choice of the 
sequence (r„)„>i the Hausdorff distance between X n n C and X p C\C goes to zero for all compact subsets C. Here 
we recall that the Hausdorff distance between two subsets A, B c X is 



dn(A, B) = max ^ sup dxia, B) 1 sup <i_fs-(6, A) f , 



where dx(x, Y) = inf^gy dx{x, y). 
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Theorem 6. Under Assumption^ ifh separates the set X p and the sequence (A n ) n >i satisfies |27j, then for any compact 
subset C C X 

lim dn{X n nC,I„nC) = almost surely, 

n— >oo 

provided that the threshold parameters (r„) n >i are such that 

lim T n = 

limsu P SUPa:ec|J? " (a:) ~ Fp(a:)l < 1 almost surely. (29) 

Proof. Without loss of generality, we may assume that X itself is compact and prove the statement for C = X. The 
proof splits into two steps. First we show that 

lim sup d,K(x,X n ) = 0. 

n^oc xeX 



Indeed, if the sequence (r„)„> 1 is chosen as in < |29) , then there exists n such that for all n > n 

\F n (x) - F p (x)\ <r„ Vx G X. 

If x G X p , then 

F„(ar) - 1 = F„(ar) - F p (x) > - \F n {x) - F p (x)\ > -r„, 

hence a; e X„. Thus, (1k{%, X n ) = for all n > n . 
Then, we prove that 

lim sup d,K{x,X p ) = 

rwoo x&Xn 

by contradiction. If we assume the opposite, then there exists e > such that for all k there is Ufc > k satisfying 

sup^gj^ cLk{x, X p ) > 2e. Hence there is Xk € X nk such that 

dx{xk,x) > e for all x e X p . (30) 

Since X is compact, possibly passing to a subsequence we can assume that the sequence (xk)k>i converges to a 
limit xq. We claim that xq e X p . Indeed 

I^p(xo) - 1| < |F p (xo) ~ F p (x k )\ + \F p (x k ) - F njk (xk)\ + \F nk (x k ) - 1| 
< |F p (so) - F p (x fe )| + sup - F n(t (x)| + T„ fc , 



where \F nk (x k ) — 1| < r„ fc is due to the fact that £ fe e X njt , so that 

1 + T nk > 1 > (Xfc) > 1 - T nk . 

As rife goes to oo, since -F p is continuous in xo, F nk converges to F p uniformly by Theorem |5]and r„ fc goes to zero, 
it follows that F p (xq) = 1, that is xq e X p . However, |30l implies that (1k(xo,x) > e for all x e X p , which is the 
desired contradiction. □ 

We add some comments. First, it is not difficult to show that, if the metric space X is locally compact and the 
kernel K is such that 

lim K(y, x) = 

y-¥oo 

for all x e X - as it happens e.g. for the Abel kernel - then Theorems [5] and 6]also hold choosing C = X. Second, 
if % does not separate X p , the statement of the two theorems continues to be true provided that the support X p 
is replaced by the level set {x € X | F p (x) — 1}. Note that, although the Hausdorff distance dn has been defined 
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with respect to the metric dx induced by the kernel, however, if the set X has its own metric dx and the hypotheses 
of Proposition [3] are satisfied, then Theorem [6] implies convergence of X n n C to X p n C also with respect to the 
Hausdorff distance associated to dx- Finally, we remark that in Theorem [6] the convergence rate of the sequence 
(T"n)n>i depends on the rate of convergence of F n to F p , and in particular of G\ n to F p , which itself depends on 
some a priori assumption on p. The precise evaluation of the convergence rates of F n to F p for the Tikhonov filter 
r x( (J ) = c/ic + X) is the argument of the next section. 



5.3 Finite Sample Bounds and Stability of Random Sampling 

In order to prove stability of our algorithms under random sampling and determine their convergence rates, we 
need to specify suitable a priori assumptions on the class of problems to be considered. In the present section, a 
detailed analysis will be carried out for the case of the Tikhonov filter r\(a) = a /{a + A). The techniques in |[T3] 
should allow to derive similar results for filters other than Tikhonov. 
For all A > we define 

AA(A)=tr[(T + A)- 1 T]=^-^ rT , 

which is finite since T is a trace class operator. The above quantity is related to the degrees of freedom of the 
estimator |34|. Here, we recall that N is a decreasing function of A and lim A _ i .o+ M(X) — N , where N is the 
dimension of the range of T. 

The a priori conditions we consider in the present paper are given by the following two assumptions, which 
involve both the reproducing kernel K and the probability measure p (compare with llT2l[Tll '). 

Assumption 3. We assume that 

a) there exist b € [0, 1] and Df, > 1 such that 

supA/"(A)A b < D 2 b - (31) 

A>0 

b) there exist < s < 1 and a constant C s > such that P P K X e ran T s / 2 for all x € X, and 

sup \\T-ip p K x \\ 2 < C s . (32) 

x£X 



The above conditions are classical in the theory of inverse problems and have been recently considered in super- 
vised learning. Before showing how they allow to derive a finite sample bound on the error sup^g^ \F n (x) — F p {x)\, 
we add some comments. First, Assumption |3 |a| is related to the level of ill-posedness of the problem [31 ] and can 
be interpreted as a condition specifying the aspect ratio of the range of T. Since < AA^(A) < tr [T] = 1, inequal- 
ity | |3"T) is always satisfied with the choice b — 1 and D\ = 1, so that in this case we are not imposing any a priori 
assumption. If dim ran T — N < oo, the best choice is b = and D Q = y/N ; otherwise, if dim ran T = oo, then 
necessarily b > 0. In the latter case, a sufficient condition to have b < 1 is to assume a decay rate o-j ~ r 1/b on the 
eigenvalues of T (see Proposition 3 of lElO . 

Coming to Assumption |3|b[ , first of all we remark that it is always satisfied when dim ran T is finite with the 
choice s = 1 and C\ = maxj £ j 1/aj. In the general case, inequality | [32) can be expressed by either one of the 
following equivalent conditions 

^ri/jWI 2 < C« VxeX 

(33) 

X>}~"l&(:c)| a <C S VxeX, 
where {fj,crj)jeJ and {<f>j,crj)j e j are the singular valued decompositions of the operators T and Lk, respectively, 



which were defined in Section 2.3 (see in particular ||8j for the definition of the functions <j>j outside the set X p ). 



Clearly, the higher is s, the stronger is the assumption. 
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Note that in particular inequality p3| holds true if there exists a constanj^K > such that sup^g^ 1^ (^)1 < K 
for all j € J, and s e]0, 1] is chosen to make the series X^e./ <7 j _S finite- m this case, it is quite easy to give 
conditions on the eigenvalues (aj)j^j assuring that both Assumptions |3|a| and |3|b| are satisfied. For example, if 
(jj ~ for some < b < 1, then [ |3"l"| l holds true with this choice of b, and ( (32} is satisfied for any < s < 1 — b. 

A direction of future work is to study the geometric nature of the the above conditions in the case in which X is a 
metric space, or when X is a Euclidean space and X p a Riemannian submanifold. 

The following theorem provides the finite sample bound on the error sup xeX \F n {x) — F p (x)\. 

Theorem 7. Suppose r\(a) ~ a/(a + A). If Assumption^holds and we choose 



then, for n > 1 and 5 > 0, we have 



sup F n (x) - F p (x) < (C s V (D b (2S V VzS))) - (34) 
xex \nl 



-6 



with probability at least 1 — 2e 

We postpone the proof to the end of the current section and add here some comments. The above finite sample 
bound quantifies the stability of the estimator with respect to random sampling. Equivalently if we set the right 
hand term of the inequality to e and solve for n = n(e, S), we obtain the sample complexity of the problem, i.e. how 
many sampl es a re needed in order to achieve the maximum error e with confidence 1 — 2e~ A . As remarked before, 
Assumptionplal is verified for b = 1 by any reproducing kernel. In this limit case our result gives a rate n~ s ^ 2s+2 \ 



comparable with the one that can be obtained inserting 1 26 1 and 1)35} below into inequality 1 23 1, with \\T n — T\ 
bounded by | [2l) . 

Note that, if dim ran T = N < oo, choosing b — 0, D = yN, s = 1 and C\ — max Jg j l/°> f ne ra f e i n p4] 
becomes n -1 / 3 . 

The proof of Theorem [7] follows the ideas in [12] and is based on refined estimates of the sample and approxi- 
mation errors. The techniques in [13 j should allow to derive similar results for filters beyond the Tikhonov one. 



Proposition 10. If Assumption 3]a I holds true, then, for n > 1 and 6 > 0, w e have 



sup\F n (x)-G Xn (x)\ < 
xex 




S 25N{X T 



Tl\ 7) \ Tl\ r 



with probability at least 1 — 2e 5 . 

Proof. Consider the following decomposition 

rx n {T) - rx n { T n) = + A„)- 1 T - (T n + A„)- 1 T„ 

= (T + \ n )~ l T — (T + A n )- x T n + (T + K)- 1 ^ - (T n + Ky 1 ^ 

= {T + X n )- l {T - T n ) + (T + A„) _1 [(T„ + A n ) - (T + A„)](T n + A„)" 1 T„ 

= (T + A„) _1 (T - T n ) + (T + X n )~ 1 (T n - T) (T„ + A n ) _1 T n 

= (T + A„)- 1 (T - T n )[I - (T n + X^Tn] 

= A ri (r + A„)" 1 (T-T n )(T n + A„)- 1 . 

It is easy to see that || (T„ + A„) _1 1| < A" 1 , hence 

IK (T) - r Xn (T n )\\ S2 < X n || (T + A n )- X (T - T n ) || fia || (T„ + A„)" 1 L < || (T + A„)- 1 (T - T n ) || Sa . 

2 As as it happens for example for reproducing kernels on X = [0, 2ir] d which are invariant under translations, when p is the Lebesgue 
measure on [0, 2ir] d . 



24 



Then, from Lemma |8]in the Appendix we have that 



(T + X n I)-\T-T n )\L < 



nX, 



1 2SJ\f(X ri 
nX„ 



with probability at least 1 — 2e 6 , so that the result follows by ( |25) . 



Proposition 11. If Assumption 3 b i /zoZds frae, then 



sup\G x (x)-F p (x)\<X s C s . 

xex 

Proof. Since 9(a) — r\(o) = X/(a + A) for all a > 0, we have 

|G A (z) - F,(aO| = \((r x (T)-6{T))K x ,K x )\ = |<(r A (T) - 9(T))P p K x , P p K x )\ 

2 



□ 



(35) 



A 



{T + X)-*P P K X 



as P p K x € kerT- 1 . Since by assumption P p K x e ran T s / 2 for some < s < 1, spectral calculus and the bound 
<j s I [a + A) < A s_1 give the inequality 



{T + X)-*P p K x = [(T + X^T^T-iPpK, 



< X s - 1 \\T—2 P p K x \\ , 



so that 



□ 



\G x (x) - F p (x)\ < X s \\T--P p K x \\ < X S C S 

for all x € X. 

We are now ready to prove the main result. 

Proof of Theorem \7\ T he choiche A„ = n - 1 /( 2s + b + 1 ) i s the one that set the contributions of the sample and approxi- 
mation errors in \23\ to be equal. Indeed, we begin by simplifying the bound on the sample error. If A > rT 1 , then 
nX > VnX b+1 for all < b < 1, so that 



nX V nX nX V nX b+1 ~ v ; \nX y/ n X b+1 J 



< 



2D b (5V V25) 
x/nX^ 1 ' 



where we used the definition of D b (and the fact that D b > 1). Then, by the above inequality and Propositions 10 
and 11 inequality |23j gives 

sup \F n (x) - F p (x)\ < C S X S + 2g ^ffl - (36) 
xex VnX b+1 

If we set the contributions of the sample and approximation errors to be equal, the choice for A is 

i 

1 \ 2s + 6+l 



It is easy to see that A > n 1 for all values of s, b, so that from p6) we have 



sup \F n (x) - F p (x)\ < (C s V (2D b (S V V26))) 



xex 



□ 
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5.4 The kernel PCA filter 



A natural choice for the spectral filter r x would be the regularization defined by kernel PCA |57], that corresponds 
to truncating the generalized inverse of the kernel matrix at some cutoff parameter A. The corresponding filter 
function is 

'\ a > X 



a < X 



The above filter does not satisfy the Lipschitz condition 2[c| in Assumption|2j so that the bound | [26) for the sample 
error sup xeX | F n (x) — G\ n (x) | does not hold in this case ^However, we can still achieve an estimate by employing 
inequality p8) in A. 2 To this aim, with a slight abuse of the notation, here we count the eigenvalues of T and T n 
without their multiplicities and we list them in decreasing order. Furthermore, for any A > we set <Jj(\) and cr^fx) 
as the smallest eigenvalues of T and T n which are greater or equal to A, i.e. 



01 > 02 > • • • > CTj(A) > A > Cj(A)+l 



> a k{\) - A> °fe(A)+r 



Inequality < |38) implies that 



\r x (T n )-r x (T)\ 



s 2 



< 



S-2 



minj^A) 

and inequality | [25) for the sample error then reads 



Jn) (n) 
°/c(A) + l' °fc(A) 



sup \F n (x) - G A „ 

x£C 



x) < 



T i(A)+i 



\T n 



_ < tt-Ln ~ J- Il5 2 

| ~ min{cr j(A) - A, A - ct j(a)+1 } 



lin {<Jj(\ n ) - A„, A„ - Cj(A„)+i} 



By Lemma |2j in order to have convergence to of the right hand side of this expression we need to choose the 
sequence (A„)„>i such that 

logn 

sup — — — ( — ^ < oo. 

„>i y/nmm (<t 3 (a„) - A„, A„ - <Tj{\ n )+l) 

Since the gap <Jj(\) — Cj(A)+i can have any arbitrary rate of convergence to zero as A — » + , we thus see that there 
exists no distribution independent choice of (A„) n >i ensuring the convergence to zero of the above bound. 

Note that r\(T) is the projection Pj(\) onto the sum of the eigenspaces of the first j(X) eigenvalues of T and 

r\(T n ) is the projection Pj$\ onto the sum of the eigenspaces of the first k(X) eigenvalues of T. If (M n ) n >i is any 
strictly increasing sequence with M n € N for all n, we can consider the following distribution dependent choice 

A„ = (<TM n + 0"m„+i)/2. Then we have 



M n 



Pi 



S-2 



\\rx n (T n )-r Xn (T)\[ 



2 \\T n - T\ 



< 



s 2 



which recovers a known result about kernel PCA (see for example [71 J). Furthermore, if we have that \\T n — T\\ S2 < 

( tJ M n — 0"m„+i)/2, then we obtain 



>(n) 



So 



< 1, hence dim ran 

Pff = dim 

ran P i 



The following result extends Theorem [5] to the case of kernel PCA, at the price of having a distribution depen- 
dent choice of the cut-off sequence (M„)„>i, 

Theorem 8. If the sequence of natural numbers (M„)„>i is strictly increasing and such that 

logn 



sup — n , 

n>l V n l cr M„ — O'Mn+l) 



< +00 



Note that, by Proposition 14 in A.l if X is locally compact, then F n defined in |16| still is a C(X)-valued estimator. 
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and we define the sequence (A„)„>i as 



, _ &M„ + a M n + l 

A„ — . 



then, for every compact subset Ccl, 



lim sup\F n (x) — FJx)\ = almost surely . 



Proof. By the above discussion and inequality < |25) , 

2||T n -T|L Jri\\T n -T\\ q 21oen 
sup |F„(x) - G Xn (x)\ < ^ < ^ sup — Zi ° gn 

i£C a M n — &M n +l log 71 n>l V n ( a 'M n — Cj\/„ + l] 

Convergence to of the sample error then follows from | [22) . Combining this fact and Proposition|8]into inequality 
((231, the claim then follows. □ 



6 Some Perspectives 

In this section we discuss some different perspectives to our approach and suggest some possible extensions. 

6.1 Connection to Mercer Theorem 

We start discussing some connections between our analytical characterization of the support of p and Mercer 



theorem |45 1. With the notations of Section 2.3 the fact that the family (fj)jej is an orthonormal basis of P p 7i, the 



reproducing property and ||6jl give the relation 

(P p K y , K x ) = fi ( X )W = E (*) W Vx ' v £ x > ( 37 ) 

where the series converges absolutely. Note that in this expression the eigenfunctions 4>j of Lk are defined outside 
X p through the extension equation |8j. Restricting f37| | to x, y e X p , we obtain 



K ( X ,V) = 'J2 ( rj<l>j{ x ) ( t ) j(y) Vx,yeX p , 

which is Mercer theorem 1631 . In particular, for x = y we have a i l^iC^OI = K( x > x ) f° r al l x G X p . On the 
other hand, the assumption that the reproducing kernel separates X p precisely ensures that 

a j | (a^) | 2 7^ K(x, x) for all x ^ X p . 

(Recall that, if K separates X p , then X p is the 1-level set of the function F p = J2je J a i I •) 
6.2 A Feature Space Point of View 

In machine learning, kernel methods are often described in terms of a corresponding feature map |68|. This point 
of view highlights the linear structure of the Hilbert space and often provides a more geometric interpretation. 

We recall that a feature map associated to a reproducing kernel is a map ^ : X — >• T , where J 7 is a Hilbert space 
with inner product (•, ■) -p, satisfying K(x, y) = (^(y), 4 , (x)) J r • While every map ^ from X into a Hilbert space T 
defines a reproducing kernel, it is also possible to prove that each kernel has an associated feature map (and in fact 
many). Indeed, given K, the natural assignment is J- = H and ^>(x) = Q(x) = K x . Such a choice is also minimal, 
in the sense that, if we make a different choice of T and , 5, then there exists an isometry W : H — > J- such that 
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We next review some of the concepts introduced in Section|2]in terms of feature maps. For the sake of compar- 
ison we assume that 1 1 ^ (x) 11^ = 1 for all x 6 X (this corresponds to the normalization assumption Id), we let Tc 
be the closure of the linear span of the set {^(x) \ x e C}, and define 

d^{x),F c )= inf 
fesc 

It is easy to see that the definition of separating kernel has the following equivalent and natural analogue in the 
context of feature maps. 

Definition 3. We say that a feature map ^ separates a subset C C X if 

dr(*(x),F c ) = <=> x e C. 

The above definition is equivalent to Definition [T] since djr(^(x), J~c) — \\^( x ) ~ Qc^(%)\\jrr where Qc is the 
orthogonal projection onto Tc- Then, according to Definition^ a point x e C if and only if ^(a;) — Qc^ix)^ 
0, Since = WK X Viel and Q C W = WP C , this is equivalent to 

0= \\^(x)-Qc^(x)\\% = \\K X -P C K X \\ 2 = K(x,x)-F c (x). 

Theorem[T]then implies that Definition[T|and|3]are equivalent. We thus see that the separating property has a clear 
geometric interpretation in the feature space: the set \&(C) is the intersection of the closed subspace To, i e- a linear 
manifold in T , and ty(X) - see Figure|2 

In the above interpretation, the estimator we propose for the support then stems from the following observa- 
tion: given a training set x\, . . . , x n/ we classify a new point x as belonging to the estimator X n of X p if the distance 
of ty(x) to the linear span of {^(xi), . . . ^(xn)} is sufficiently small. 

Given a training set {xi, . . . , x n }, our estimator F n classifies a new point x as belonging to the support if the 
distance of ty(x) to the linear span of $(xi), . . . , ^(x n ) is sufficiently small. 



6.3 Inverse Problems and Empirical Risk Minimization 

Here we suggest a simple interpretation of the estimator F n and stress the connection with the supervised setting. 
We regard the sampled data x\, . . . ,x n as a training set of positive examples, so that each point x^ € X p almost 
surely; the new datum is the point x G X, and we evaluate the estimator F n at x. We label the examples according 
to the similarity function K by setting 

Ui(x) = K{xi,x) = (K x )i i = l,...,n. 

If K satisfies Assumption]!} then, since K(x, x) = 1 and K is d/f-continuous, the function yi is close to 1 whenever 
Xi is close to x. The interpolation problem 

find / E H such that f( Xi ) = y l {x) Vi € {1, ... ,n} ^> S n f = K x 

(where S n is defined in |18|) is ill-posed. To restore well-posedeness we can consider the corresponding least 
square problem (empirical risk minimization problem) 



1 - 

i—l 

or in fact its regularized version 

l£ -y 4 (z)| 2 + A ll/ll 2 ) 



mm 

fen \ n ■ 



min - \\S n f - Ka;||c, 
fen n 



mm | - \\S n f - K x \\^ n 



fen \n 



A||/|f 
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Figure 2: The sets X and the support X p are mapped into the feature space JF, by the feature map ^ . Here we take 
J- p = J~x to be a linear space passing through the origin. The image of the support with respect to the feature 
map is given by the intersection of the image of X with T p . By the separating property a point x belongs to the 
support if and only the distance between &(x) and T p is zero. 



where A > is the regularization parameter (Tikhonov regularization). It is known [31 J that the minimum of the 
above expression is achieved by / = /„, with 

in = -9\\ )b n y, 

n n 

where ,9a is the function g\(a) = l/(cr + A). 

More generally Tikhonov regularization can be replaced by spectral regularization induced by a different choice 
of the filter g\, the corresponding regularized solution is still given by the previous equation, but the function 
<?a appearing in it is now completely arbitrary. Comparing with (19) , we see that = F n (x). Equation (T7| 

has then the following interpretation: a new point x is estimated to be a positive example (that is, to belong to the 
support X p ) if and only if /„" (x) > 1 — t, where t is a threshold parameter. 

The above discussion suggests several extensions and variations of our method, obtained considering more 
general penalized empirical risk minimization functionals of the form 



where: 



V is a (regression) loss function measuring the approximation property of /, for example the logistic loss 
or a robust loss such as the one used in support vector machine regression. Our theoretical analysis does 
not carry on to other loss functions and different mathematical concepts from empirical process theory are 
probably needed; 
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3vs8 


8vs3 


lvs 7 


9vs4 


CBCL 


Spectral 


0.837 ±0.006 


0.783 ±0.003 


0.9921 ± 0.0005 


0.865 ±0.002 


0.868 ±0.002 


Parzen 


0.784 ±0.007 


0.766 ± 0.003 


0.9811 ±0.0003 


0.724 ±0.003 


0.878 ±0.002 


1CSVM 


0.790 ±0.006 


0.764 ±0.003 


0.9889 ±0.0002 


0.753 ±0.004 


0.882 ± 0.002 



Table 2: Average and standard deviation of the AUC for the different estimators on the considered tasks. 



• R is a regularizer measuring the complexity of a function / e H. For example, one can consider the case 
where the kernel is given by a dictionary of atoms / 7 : X — » C, with 7 e T, such that X) 7 gr \f-y( x )\ 2 = 1/ 
so that we have if (a;, y) = X) 7 gr f-y( x )f-y(.v) an d/ hence, / = 2~) 7 er w -yfjr with w = (u; 7 ) 7e r G ^a(r). In this 
setting, Tikhonov regularization corresponds to the choice R(f) = X) 7 gr \ w j\ 2 > but other norms, such as the 
t\ norm X) 7 er l^l' can a ^ so ^ e considered. 

7 Empirical Analysis 

In this section we describe some preliminary experiments aimed at testing the properties and the performances of 
the proposed methods both on simulated and real data. We only discuss spectral algorithms induced by Tikhonov 
regularization to contrast the general method to some current state of the art algorithms. Note that while com- 
putations can be made more efficient in several ways, we consider a simple algorithmic protocol and leave a 
more refined computational study for future work. Recall that Tikhonov regularization defines an estimator 
F n (x) = ~K x *(K n + n\)~ 1 'K x , and a point x is labeled as belonging to the support X p if F n (x) > 1 — r. The 
computational cost for the algorithm is, in the worst case, of order n 3 - like standard regularized least squares - 
for training, and order Nn 2 if we have to predict the value of F n at N test points. In practice, one has to choose a 
good value for the regularization parameter A and this requires computing multiple solutions, a so called regular- 
ization path. As noted in [52 1, if we form the inverse using the eigendecomposition of the kernel matrix the price of 
computing the full regularization path is essentially the same as that of computing a single solution (note that the 
cost of the eigen-decomposition of K n is also of order n 3 , though the constant is worse). This is the strategy that 
we consider in the following. In our experiments we considered two datasets: the MNISlj^jdataset and the CBClj^] 
face database. For the digits we considered a reduced set consisting of a training set of 5000 images and a test set 
of 1000 images. In the first experiment we trained on 500 images for the digit 3 and tested on 200 images of digits 
3 and 8. Each experiment consists of training on one class and testing on two different classes and was repeated 
for 20 trials over different training set choices. For all our experiments we considered the Abel kernel. Note that 
in this case the algorithm requires to choose 3 parameters: the regularization parameter A, the kernel width a and 
the threshold r. In supervised learning cross validation is typically used for parameter tuning, but cannot be used 
in our setting since support estimation is an unsupervised problem. Then, we considered the following heuris- 
tics. The kernel width is chosen as the median of the distribution of distances of the fc-th nearest neighbor of each 
training set point for k = 10. Fixed the kernel width, we choose the regularization parameter in correspondence 
of the maximum curvature in the eigenvalue behavior - see Figure [3]- the rationale being that after this value the 
eigenvalues are relatively small. 

For comparison we considered a Parzen window density estimator and one-class SVM (1CSVM) as imple- 
mented by IfTOl . For the Parzen window estimator we used the same kernel of the spectral algorithm, that is the 
Laplacian kernel, and also the same width. Given a kernel width, an estimate of the probability distribution is 
computed and can be used to estimate the support by fixing a threshold r' . For the one-class SVM we considered 
the Gaussian kernel, so that we have to fix the kernel width and a regularization parameter v. We fixed the kernel 

4 http ://yann.lecun.com/exdb/ mnist / 
5 http:/ / cbcl.mit.edu/ 
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Figure 3: Decay of the eigenvalues of the kernel matrix ordered in decreasing magnitude and corresponding 
regularization parameter in logarithimic scale. 



MNIST9i.s4 MNISTlus7 CBCL 




0.1 0.2 0.3 0.4 0.5 0.0 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.2 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.3 0.9 1 



Figure 4: ROC curves for the different estimator in three different tasks: digit 9vs 4 (Left), digit lvs 7 (Center), 
CBCL (Right). 



width to be the same used by our estimator and set v = 0.9. For the sake of comparison, also for one-class SVM 
we considered a varying offset r . The performance is evaluated computing ROC curve (and the corresponding 
AUC value) for varying values of the thresholds t,t',t . The ROC curves on the different tasks are reported (for 
one of the trials) in Figure |4j Left. The mean and standard deviation of the AUC for the three methods is reported 
in Table [2] Similar experiments were repeated considering other pairs of digits, see Table [2] Also in the case of the 
CBCL datasets we considered a reduced dataset consisting of 472 images for training and other 472 for test. On the 
different test performed on the MNIST data the spectral algorithm always achieves results which are better - and 
often substantially better - than those of the other methods. On the CBCL dataset SVM provides the best result, 
but spectral algorithm still provides a competitive performance. 



A Auxiliary Proofs 

In this section we give the proofs of some technical results needed in the paper. 
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A.1 Analytic Results 

In this section, we suppose that the kernel K satisfies Assumption [lj and endow the set X with the metric 
induced by K. The next simple lemma will be used frequently. 

Lemma 3. For all k — 1.2, the map 

£:X^S kl £{x) = K x ® K x 
is continuous and measurable. Moreover, if Zi : SI — > Sk is given by 

Zi{w) = K Xi ® K Xi uj = (xj)j>i, 

then Zi is measurable for all i > 1. 

Proof. The map $ : X — > %, with $(x) = K x , is continuous by item|IJ in Proposition [T] Since £(x) = $(x) ® $(x), 
continuity of £ follows at once. By item|v| in Proposition]!] £ is then a measurable map, hence Zi is such. □ 

We recall some basic properties of the operator T defined by the kernel. The next result is known (see for 
example |25|), but we report a short proof for completeness. 

Proposition 12. The S\-valued map £ defined in Lemma^is Bochner-integrable with respect to p, and its integral 

T= [ K x ®K x dp{x) 
J x 

is a positive trace class operator on H, with \\T\\ S — tr [T] = 1. 

Proof. The map £ isbounded because \\K X K x \\ Si = tr [K x ® K x ] = K(x, x) = 1 and measurable by Lemmaj3]. 
Therefore, £ is a Bochner-integrable Si -valued map, and its integral T is a trace class operator. As is a positive 
operator for all x, so is T. In particular, ||T|| 5 = tr [T], and tr [T] = f x tr [K x K x ] dp(x) = 1. □ 

Now, we come to the proof of Proposition [7] We will split it into the proofs of Propositions 13 and 14 below. 

Lemma 4. For all k — 1,2, the map 



1 ™ 

f n : X n ~> S k , f n (x 1 ,...,x n )=-} K. 

71 * * 



K x 



n 



is continuous and measurable. 

Proof. Evident by Lemma [3] □ 



Proposition 13. For all n > 1, the map T n defined in 1 15 1 is a Sk-valued estimator for k = 1,2. 
Proof. We have 

T n (u) = f n (x 1 ,...,x n ) w = {xi)i>i, 
hence T n is measurable by Lemma |4] □ 

For the next proposition we recall that the topology of uniform convergence on compact subsets of X is gener- 
ated by the following basis of open sets U f l€> c C C(X) 



U f .^ c = ige C{X) | sup \f{x) - g(x)\ <e} f G C(X), e>0,CdX compact. 

Proposition 14. Suppose X is locally compact. Let (r>,) A>0 be a family of functions r\ : [0, 1] — > [0, 1] such that each r\ is 
upper semicontinuous. Then, for any sequence of positive numbers (A„)„>i and all n > 1, the map F n defined in < (T6) is a 
C{X)-valued estimator, where C(X) is the space of continuous functions on X with the topology of uniform convergence on 
compact subsets. 



32 



Proof. Throughout the proof, n > 1 will be fixed. Let (<^fe)fe>i be a decreasing sequence of continuous functions 
ip k : [0, 1] -> [0, 1] such that ip k {a) 1 r A „(<j) for all a G [0, lj (such sequence exists by (12.7.8) of |29|). Then, by 
Lemma|4]and continuity of the functional calculus (see e.g. Problem 126 in Il33l ), for all k > 1 the map 

<Pk(T n ) ■■ X n -> Sq, [^fe(T„)](xi, . . . ,a;„)=< ( 5 fe (T„(a;i, . . .,x„)) 

is continuos from X n into the Banach space So of the bounded operators on H with the uniform operator norm. 
Thus, for all x € X, the real function (xi, . . . , x n ) H> ([y>fc(T n )](a;i, . . . , x n ) K Xl K x ) is continuous on X n , hence 
is measurable by item [v} of Proposition [T] By spectral calculus and dominated convergence theorem, for all ui = 

(Xi)i>l 

(r\„(T n (u)) K X ,K X ) = (r Xn (f n (x 1 , . . . ,x n )) K x , K x ) = lim ([(p. k (f n )](xi, . . . , x n ) K x , K x ) 

k — ^oo 

It then follows that, for each x € X, the real function uj h-> (r\ n (T n (co))K X) K x ) is measurable on O, being the 
pointwise limit of measurable functions. 

We now prove that the map F n : oo t-t (x i-> {r\ n (T n {ui))K x ,K x )) is measurable from ft into the space C(X). 
Since X is locally compact and second countable, the topology of uniform convergence on compact subsets is a 
separable metric topology on C(X), hence it is enough to show that the the inverse images of all open sets of C{X) 
are measurable. For 

U f , e ,c = {ge C(X) | sup \f(x) - g(x)\ <e\ f G C(X), e>0,CdX compact, 
I xec ) 



we have 



F- l {U f ^c) = \u S I sup |/(a;) - (r A „(T n (w)) K x , K x )\ < e 



By separability of X, there exists a countable set Cq C C such that Cq = C. A continuity argument then shows that 

Fn 1 (U UiC )= Pi \ u€n\ S up\f(x)-{r Xn (T n (uj))K x ,K x }\<e-l) 
k>i < xec fcJ 

= fl fl € Q | |/(a;) - (r\ n (T n (u>))K x ,K x )\ < e - ^| . 

k>lxeC ^ ' 

Since each set {uj G Q \ \f(x) — (r\ n (T n (u)) K X ,K X )\ < e— 1/k} is measurable in f2, measurability of the countable 
intersection F~ 1 (Uf, e ,c) then follows. □ 

A.2 A useful inequality 

The following proof of inequality fl39) below is due to A. MaureiQ 

Lemma 5. Suppose S and T are two self-adjoint Hilbert-Schmidt operators on % with spectrum contained in the interval 
[a,b], and let (crj)j£j and (Tk)keK be the eigenvalues of S and T, respectively. Given a function r : [a,b] — > M, if the 
constant 

r((jj) - r(r k ) 



L = sup 

]£j,k£K 



(with 0/0 = 0) 



is finite, then 

\\r(S)-r(T)\\ S2 <L\\S-T\\ S2 . (38) 
In particular, if r is a Lipshitz function with Lipshitz constant L r , then 

\\r(S)-r(T)\\ s <L r \\S-T\\ s . (39) 



6 http:/ /www.andreas-maurer.eu 
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Proof. Let (fj)je.j and (g k ) k ^K be the orthonormal bases of eigenvectors of S and T corresponding to the eigen- 
values (<jj)jfzj and (rk)kGK, respectively, which here we list repeated accordingly to their multiplicity. We have 

MS) - r(T)f S2 = J2 IMS) ~ r(T))f„g k )\ 2 = ]T (rfa) - r{r k )f \{f v g k )\ 2 
<L 2 ^ & r k ) 2 \{f„g k )\ 2 = L 2 J2 \((S T)f 3 ,g k )\ 2 

j-.k j,k 



L \\S-T\ 



s 2 



which is (38). 



□ 



A.3 Concentration of Measure Results 

We will use the following standard concentration inequality for Hilbert space random variables (see Theorem 8.6 
in |48|, and [49]). Let V be a separable Hilbert space and (Q, An, P) a probability space. Suppose that Y\, Yi, . . . is 
a sequence of independent V-valued random variables Yi : O -> V. If E[||Y;-||™] < {l/2)m\B 2 L m - 2 Vm > 2, then, 
for all n > 1 and e > 0, 



> e ) < 2e s 2 +it+s\/B 2 +2L€ 



(40) 



We will need in particular the next two straightforward consequences of this inequality. 



Lemma 6. If Z\, Z%, . . . is a sequence ofi.i.d. V-valued random variables, such that \\Zi\\ v < M almost surely, E[Zi] = fx 
andE[\\Zi\\ 2 ] < a 2 for all i, then, for all n > 1 and 6 > 0, 



1 - 

m ^ — * 



< 



MS 2a 2 8 



(41) 



with probability at least 1 — 2e °. 

Proof. Let Y t = Z< - p. Then ||l;|| v < 2M and E[||F||y] < E[||^||^] = a 2 . Moreover, for all i and m > 2 
E[||F||v] < cr 2 (2M) m - 2 < (l/2)m!a 2 M m - 2 , where the last inequality follows since 2 m ~ 2 < ml/2. Then, 



i=i 



> e 



> e < 2e » 2 +«<+'\/?+!» — 2e 



2e" 



where .<?(<) = i 2 /(l + t + \/l + 2t). 

Since = t + \/2~t, by solving the equation (<7 2 n/M 2 )g(Me/a 2 ) — S we have 



M \ na 2 



a 2 ( M 2 6 2M 2 S\ MS 2a 2 5 
e = — I + 



/ n 



The above result and Borel Cantelli lemma imply that 



□ 



lim 



n 

71 ^ ' 



= 



almost surely. In the paper we actually need a slightly stronger result which is given in the following lemma. 
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Lemma7. If 'Zi,Z 2 ,. ■ . is a sequence of i.i.d. V-valued random variables, such that \\Zi\\ v < M almost surely, then we have 



lim 

n-^oo log n 



1 " 
m — ^ 



i=l 



almost surely. 

Proof. We continue with the notations in the proof of Lemma[6] By l |40) , for all e > we have 



logn 



i=l 



> e 



> e 



log n 



< 2e~ A( ™^ = 2 



1\ lo g » 



with 



It follows that 



e 2 log 2 n 



A(n,e)=- 



a 2 +Afe 1 ^+a,/a 2 + 2Me 



- log" 



E : 

n>l 



logn 



E^ 



>e <2j: 



n>l 



1\ log„ 



For all e > 0, lim n _>oo A{n, e)/ log n — +oo, so that the series J2 n >i n A ^ n ' e ^ log " is convergent, and Borel-Cantelli 
lemma gives the result. □ 

The following inequality is given in [12J and we report its proof for completeness. 

Lemma 8. If Assumption^holds true, then for all 5 > we have 



\(T+X)-\T-T n )\L < — + 



n A 



nA 



) 



with probability at least 1 — 2e °. 



Proof. Let (17, Aq,F) be the probability space defined at the beginning of Section 5.1 For all i > 1 we define the 
random variable Yi : fl —> S2 as 

Yi(u) = (T + \)~ 1 (K Xz ® X X4 ) w = (ajj)j>i, 

which is measurable by Lemma|3] Then, we have ||ii|| S2 < 1/A almost surely, E[Yj] = (T + A) _1 T, (1/n) 2^™=i ^ = 
(T + A)" 1 T n and 



E [|| y i|ls 2 ] = /_ tr rfP (w) = / tr [(T + A)- 2 (^ ® d/»(s) 

AA(A) 



A" 



= tr[(T + A)- 2 T] < ||(T + A)- 1 || oo tr[(r + A)- 1 T] < } . 
where we have bounded the operator norm || (T + A) -1 ]]^ by 1/A. The result follows applying Lemma|6] □ 
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